{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "If you were not here for Lab 12, and need to install the graphviz package:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install --user graphviz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab 13 - Decision Trees for regression\n", "\n", "For this lab, we will return to the insurance data from Labs 7 and 8. Recall we are trying to predict the insurance cost, a quantitative value. \n", "\n", "If you don't have the dataset, download it from GitHub: [https://github.com/stedy/Machine-Learning-with-R-datasets/blob/master/insurance.csv](https://github.com/stedy/Machine-Learning-with-R-datasets/blob/master/insurance.csv)\n", "\n", "In this data, each row represents an insurance policy and the 7 columns contain the following information about it:\n", "- age: age of policy holder\n", "- sex: sex of policy holder\n", "- bmi: boday mass index (bmi) of policy holder. bmi is a (sometimes unreliable) measurement of body fat in adults\n", "- children: number of children (dependents) on the policy\n", "- smoker: whether the policy holder is a smoker\n", "- region: region of the country the policy holder lives in\n", "- charges: price for insurance policy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "from sklearn import tree\n", "import graphviz\n", "from graphviz import Source\n", " \n", "from sklearn.model_selection import train_test_split\n", "\n", "from sklearn.tree import export_graphviz\n", "import sklearn.metrics as met\n", "from sklearn.metrics import confusion_matrix\n", "\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Read the data into a dataframe and display it to make sure it was read in correctly:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sci-kit learn decision trees require numeric data. How can we convert the categorical columns into numeric data? \n", "Hint: see Lab 8" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fitting a decision tree with sci-kit learn\n", "\n", "We can get just the independent variables (x's) using the following:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X = insurance.iloc[:,[0,1,2,4,5,6,7,8]]\n", "X.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we created the decision tree variable (object) and then fit it to our data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "reg = tree.DecisionTreeRegressor(max_depth = 5)\n", "reg = reg.fit(X, insurance[\"charges\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you are running Jupyter Hub on your own computer, you may be able to display the decision tree by:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tree.plot_tree(reg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you are using the Jupyter Hub server, run the following code (which will give an error):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "dot_data = tree.export_graphviz(reg, out_file=None) \n", "graph = graphviz.Source(dot_data) \n", "graph.render(\"insurance.dot\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, despite the error, there should now be a file called happiness.dot in your directory. To view the fitted decision tree, open the happiness.dot file in Jupyter and copy the text. Paste this text into the text box at [http://www.webgraphviz.com](http://www.webgraphviz.com) and click the \"Generate graph!\" button at the bottom.\n", "\n", "The column names have been replaced by `X[0], X[1], ..., X[7]`. Run the following code to change `X[0], X[1], ..., X[7]` to the column names in insurance.dot." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open (\"insurance.dot\", \"r\") as fin:\n", " with open(\"insurance_fixed.dot\",\"w\") as fout:\n", " for line in fin.readlines():\n", " line = line.replace(\"X[0]\",\"age\")\n", " line = line.replace(\"X[1]\",\"bmi\")\n", " line = line.replace(\"X[2]\",\"children\")\n", " line = line.replace(\"X[3]\",\"sex_male\")\n", " line = line.replace(\"X[4]\",\"smoker_yes\")\n", " line = line.replace(\"X[5]\",\"region_northwest\") \n", " line = line.replace(\"X[4]\",\"region_southeast\")\n", " line = line.replace(\"X[5]\",\"region_southwest\")\n", " fout.write(line)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Copy the contents of insurance_fixed.dot into the textbox in [http://www.webgraphviz.com](http://www.webgraphviz.com) to display the decision tree with the column names. How does it compare the the decision tree you made?\n", "\n", "What happens if you change the `max_depth` parameter to 5 in DecisionTreeRegressor?\n", "\n", "Look at the leaves of your new tree. What's the smallest sample? \n", "\n", "A few of the leaves only have 1 sample. How do you think this tree would work on other insurance data?\n", "\n", "The single samples are a sign of over-fitting, and to fix it we can make `max_depth` smaller (but too small and our model will not be as good as it could be).\n", "\n", "### Testing and training data\n", "\n", "To figure out what `max_depth` should be, let's split our data into training and testing data. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a decision tree with `max_depth = 3` from the training data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Make predictions for the test data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compute the mean squared error for these predictions:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What is the mean squared error if you use `max_depth = 4`?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What is the mean squared error if you use `max_depth = 5`?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What about if you use `max_depth = 2`?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Which `max_depth` parameter should you use? What is the corresponding decision tree?\n", "\n", "You can also use a loop to quickly check the different parameter values for `max_depth`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dot_data = tree.export_graphviz(reg_depth3, out_file=None) \n", "graph = graphviz.Source(dot_data) \n", "graph.render(\"insurance_depth3.dot\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open (\"insurance_depth3.dot\", \"r\") as fin:\n", " with open(\"insurance_depth3_fixed.dot\",\"w\") as fout:\n", " for line in fin.readlines():\n", " line = line.replace(\"X[0]\",\"age\")\n", " line = line.replace(\"X[1]\",\"bmi\")\n", " line = line.replace(\"X[2]\",\"children\")\n", " line = line.replace(\"X[3]\",\"sex_male\")\n", " line = line.replace(\"X[4]\",\"smoker_yes\")\n", " line = line.replace(\"X[5]\",\"region_northwest\") \n", " line = line.replace(\"X[4]\",\"region_southeast\")\n", " line = line.replace(\"X[5]\",\"region_southwest\")\n", " fout.write(line)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we can compare the mean squared error using a Decision Tree regressor to the mean squared error computed using linear regression in Lab 8, also based on a training/testing split of 0.2. It was 41142821.67547247 (for my training/testing data).\n", "\n", "Which model is better?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Return to the decision tree classifier from last lab. Which `max_depth` is best?" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.8" } }, "nbformat": 4, "nbformat_minor": 2 }